1 Getting ready

First we load the necessary packages.

library(here) # A Simpler Way to Find Your Files, CRAN v1.0.1  
library(tidyverse) # Easily Install and Load the 'Tidyverse', CRAN v1.3.0 
library(dada2) # Accurate, high-resolution sample inference from amplicon sequencing data, Bioconductor v1.18.0 
library(plyr) # Tools for Splitting, Applying and Combining Data, CRAN v1.8.6 
library(DT) # A Wrapper of the JavaScript Library 'DataTables', CRAN v0.17 
library(plotly) # Create Interactive Web Graphics via 'plotly.js', CRAN v4.9.3 
library(biomformat) # An interface package for the BIOM file format, Bioconductor v1.18.0 

# Set seed
set.seed(1910)

Set the path to the fastq files:

path <- here::here("data/raw/casava-18-paired-end-demultiplexed-run1")
head(list.files(path))
## [1] "AqFl1-002_S1_L001_R1_001.fastq.gz"  "AqFl1-002_S1_L001_R2_001.fastq.gz" 
## [3] "AqFl1-003_S10_L001_R1_001.fastq.gz" "AqFl1-003_S10_L001_R2_001.fastq.gz"
## [5] "AqFl1-005_S19_L001_R1_001.fastq.gz" "AqFl1-005_S19_L001_R2_001.fastq.gz"

Now we read in the names of the fastq files, and perform some string manipulation to get matched lists of the forward and reverse fastq files.

# Forward and reverse fastq filenames have format: SAMPLENAME_R1_001.fastq.gz and SAMPLENAME_R2_001.fastq.gz
fnFs <- sort(list.files(path, pattern = "_R1_001.fastq.gz", full.names = TRUE))
fnRs <- sort(list.files(path, pattern = "_R2_001.fastq.gz", full.names = TRUE))

# Extract sample names, assuming filenames have format: SAMPLENAME_XXX.fastq.gz
sample.names <- sapply(strsplit(basename(fnFs), "_"), `[`, 1)
head(sample.names)
## [1] "AqFl1-002" "AqFl1-003" "AqFl1-005" "AqFl1-007" "AqFl1-009" "AqFl1-011"

2 Inspect read quality

We start by visualizing the quality profiles of the forward reads:

plotQualityProfile(fnFs[1:2])

In gray-scale is a heat map of the frequency of each quality score at each base position. The median quality score at each position is shown by the green line, and the quantiles of the quality score distribution by the orange lines. The red line shows the scaled proportion of reads that extend to at least that position (this is more useful for other sequencing technologies, as Illumina reads are typically all the same length, hence the flat red line).

The forward reads are good quality. We will truncate the forward reads at position 260.

Now we visualize the quality profile of the reverse reads:

plotQualityProfile(fnRs[1:2])

The reverse reads are of significantly worse quality, which is common in Illumina sequencing. This isn’t too worrisome, as DADA2 incorporates quality information into its error model which makes the algorithm robust to lower quality sequence, but trimming as the average qualities crash will improve the algorithm’s sensitivity to rare sequence variants. Based on these profiles, we will truncate the reverse reads at position 188 where the quality distribution crashes.

Considerations: the reads must still overlap after truncation in order to merge them later! For less-overlapping primer sets, like the V1-V2 we used for the present study, the truncLen must be large enough to maintain 20 + biological.length.variation nucleotides of overlap between them.

3 Filter and Trim

Assign filenames for the filtered fastq files.

# Place filtered files in filtered/ subdirectory
filtFs <- file.path(path, "filtered", paste0(sample.names, "_F_filt.fastq.gz"))
filtRs <- file.path(path, "filtered", paste0(sample.names, "_R_filt.fastq.gz"))
names(filtFs) <- sample.names
names(filtRs) <- sample.names

We’ll use standard filtering parameters: maxN = 0 (DADA2 requires no Ns), truncQ = 2, rm.phix = TRUE and maxEE = 2. The maxEE parameter sets the maximum number of “expected errors” allowed in a read, which is a better filter than simply averaging quality scores. We’ll also trim off the primer sequence from the forward and reverse reads by setting trimLeft = c(20, 18). Trimming and filtering is performed on paired reads jointly, i.e. both reads must pass the filter for the pair to pass.

out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, trimLeft = c(20, 18),
                     truncLen = c(260,188), maxN = 0, maxEE = c(2,2), truncQ = 2, 
                     rm.phix = TRUE, compress = TRUE, multithread = TRUE) 
head(out)
##                                    reads.in reads.out
## AqFl1-002_S1_L001_R1_001.fastq.gz    146844     91434
## AqFl1-003_S10_L001_R1_001.fastq.gz   173709    111395
## AqFl1-005_S19_L001_R1_001.fastq.gz   197511    127471
## AqFl1-007_S28_L001_R1_001.fastq.gz   148087     89004
## AqFl1-009_S37_L001_R1_001.fastq.gz   156341     99848
## AqFl1-011_S46_L001_R1_001.fastq.gz   200487    126400

Considerations: The standard filtering parameters are starting points, not set in stone. If speeding up downstream computation is needed, consider tightening maxEE. If too few reads are passing the filter, consider relaxing maxEE, perhaps especially on the reverse reads (eg. maxEE = c(2,5)), and reducing the truncLen to remove low quality tails. Remember though, when choosing truncLen for paired-end reads we must maintain overlap after truncation in order to merge them later.

4 Learn the error rates

The DADA2 algorithm makes use of a parametric error model (err) and every amplicon dataset has a different set of error rates. The learnErrors method learns this error model from the data, by alternating estimation of the error rates and inference of sample composition until they converge on a jointly consistent solution. As in many machine-learning problems, the algorithm must begin with an initial guess, for which the maximum possible error rates in this data are used (the error rates if only the most abundant sequence is correct and all the rest are errors).

errF <- learnErrors(filtFs, multithread = TRUE)
## 100632960 total bases in 419304 reads from 4 samples will be used for learning the error rates.
errR <- learnErrors(filtRs, multithread = TRUE)
## 109743840 total bases in 645552 reads from 6 samples will be used for learning the error rates.

It is always worthwhile, as a sanity check if nothing else, to visualize the estimated error rates:

plotErrors(errF, nominalQ = TRUE)

plotErrors(errR, nominalQ = TRUE)

The error rates for each possible transition (A→C, A→G, …) are shown. Points are the observed error rates for each consensus quality score. The black line shows the estimated error rates after convergence of the machine-learning algorithm. The red line shows the error rates expected under the nominal definition of the Q-score. Here the estimated error rates (black line) are a good fit to the observed rates (points), and the error rates drop with increased quality as expected. Everything looks reasonable and we proceed with confidence.

5 Sample inference

We are now ready to apply the core sample inference algorithm to the data. By default, the dada function processes each sample independently. However, pooling information across samples can increase sensitivity to sequence variants that may be present at very low frequencies in multiple samples. Here, we set pool = TRUE to retain the singletons for the downstream species richness estimation.

dadaFs <- dada(filtFs, err = errF, pool = TRUE, multithread = TRUE) 
## 65 samples were pooled: 7693112 reads in 693408 unique sequences.
dadaRs <- dada(filtRs, err = errR, pool = TRUE, multithread = TRUE)
## 65 samples were pooled: 7693112 reads in 1015042 unique sequences.

Inspect the dada-class object

dadaFs[[1]]
## dada-class: object describing DADA2 denoising results
## 2434 sequence variants were inferred from 19535 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16

6 Merge paired reads

We now merge the forward and reverse reads together to obtain the full denoised sequences. Merging is performed by aligning the denoised forward reads with the reverse-complement of the corresponding denoised reverse reads, and then constructing the merged “contig” sequences. By default, merged sequences are only output if the forward and reverse reads overlap by at least 12 bases, and are identical to each other in the overlap region (but these conditions can be changed via function arguments).

mergers <- mergePairs(dadaFs, filtFs, dadaRs, filtRs, verbose = TRUE)

# Inspect the merger data.frame from the first sample
head(mergers[[1]])
##                                                                                                                                                                                                                                                                                                   sequence
## 1 CCTAATACATGCAAGTCGAGCGCTGGAAGCTAATTTCATCCCTTCGGGATGAAATTAGTGGAAAGAGCGGCGGACGGGTGAGTAACACGTGGGCAACCTACCTATAAGACTGGGATAACTCGCGGAAACGTGAGCTAATACCGGATAATACTTTTTATTGCATAATAAGGAGTTTGAAAGGCGGCGTAAGCTGTCACTTATAGATGGGCCCGCGGCGCATTAGCTAGTTGGTGAGGTAAAGGCTCACCAAGGCAACGATGCGTAGCCGACCTGAGAGGGTGATCGGCCACACTGGG
## 2 CCTAATACATGCAAGTCGAGCGCTGGAAGCTAATTTCATCCCTTCGGGATGAAATTAGTGGAAAGAGCGGCGGACGGGTGAGTAACACGTGGGCAACCTACCTATAAGACTGGGATAACTCGCGGAAACGTGAGCTAATACCGGATAATACTTTTTATTGCATAATAAGAAGTTTGAAAGGCGGCGTAAGCTGTCACTTATAGATGGGCCCGCGGCGCATTAGCTAGTTGGTGAGGTAAAGGCTCACCAAGGCAACGATGCGTAGCCGACCTGAGAGGGTGATCGGCCACACTGGG
## 3     CCTAATACATGCAAGTCGAACGCTGCTTTTTCCACCGAAGCTTGCTTCACCGGAAAAAGCGGAGTGGCGGACGGGTGAGTAACACGTGGGTAACCTGCCCTTCAGCGGGGGATAACACTTGGAAACAGGTGCTAATACCGCATAGGATTTCTGTTCGCATGAACGGAGAAGGAAAGACGGCGTAAGCTGTCACTGAAGGATGGACCCGCGGTGCATTAGCTAGTTGGTGAGGTAAAGGCTCACCAAGGCAATGATGCATAGCCGACCTGAGAGGGTAATCGGCCACACTGGG
## 4   CCTAATACATGCAAGTCGAGCGAGTGAAATTAACTGATCCCTTCGGGGTGACGTTAATGGATCTAGCGGCGGACGGGTGAGTAACACGTGGGCAACCTGCCTGTAAGACTGGGATAACTCGCGGAAACGTGAGCTAATACCGGATAACACTTCGAACCACATGGTTTGAAGATGAAAGGCGGCGCAAGCTGTCACTTACAGATGGGCCCGCGGCGCATTAGCTAGTTGGTGAGGTAACGGCTCACCAAGGCGACGATGCGTAGCCGACCTGAGAGGGTGATCGGCCACACTGGG
## 5                 CCTAATACATGCAAGTCGAGCGGACGGATGGGAGCTTGCTCCCAGAAGTTAGCGGCGGACGGGTGAGTAACACGTGGGCAACCTGCCCTATAGTTGGGGATAACTCCGGGAAACCGGGGCTAATACCGAATGATATAATTTAGCTCCTGCTAGATTGTTGAAAGATGGTTTACGCTATCGCTATAGGATGGGCCCGCGGCGCATTAGCTAGTTGGTGAGGTAACGGCTCACCAAGGCGACGATGCGTAGCCGACCTGAGAGGGTGATCGGCCACACTGGG
## 6 CCTAATACATGCAAGTCGAGCGCGGGAATCTCTTCTGATCCCTTCGGGGTGAAGAGAGTGGAACGAGCGGCGGACGGGTGAGTAACACGTGGGCAACCTACCTATAAGACTGGGATAACTCGCGGAAACGTGAGCTAATACCGGATAATACTTTTTATTGCATAGTAAGAAGTTTGAAAGGCGGCGTAAGCTGTCACTTATAGATGGACCCGCGGCGCATTAGCTAGTTGGTGAGGTAAAGGCTCACCAAGGCAACGATGCGTAGCCGACCTGAGAGGGTGATCGGCCACACTGGG
##   abundance forward reverse nmatch nmismatch nindel prefer accept
## 1      5261       6       7    114         0      0      1   TRUE
## 2      4155       8       5    114         0      0      2   TRUE
## 3      3477      10      12    118         0      0      1   TRUE
## 4      3368      11       9    116         0      0      2   TRUE
## 5      2674      14      15    130         0      0      1   TRUE
## 6      2482      15      14    114         0      0      1   TRUE

Considerations: Most of reads should successfully merge. If that is not the case, upstream parameters may need to be revisited: Did we trim away the overlap between the reads?

7 Construct sequence table

We can now construct an amplicon sequence variant table (ASV) table, a higher-resolution version of the OTU table produced by traditional methods.

seqtab <- makeSequenceTable(mergers)

The sequence table is a matrix with rows corresponding to (and named by) the samples, and columns corresponding to (and named by) the sequence variants.

dim(seqtab)
## [1]    65 12808

Inspect distribution of sequence lengths

table(nchar(getSequences(seqtab))) %>% 
  as.data.frame() %>% 
  rename("seqence length" = Var1) %>% 
  datatable(options = list(
    columnDefs = list(list(className = 'dt-left', targets = c(0:2)))
    ))

Plot sequence length distribution

seqLen <- nchar(getSequences(seqtab)) %>% 
  as.data.frame() %>% 
  rename(seqLen = ".") 

ggplot(seqLen, aes(x = seqLen)) + 
  geom_histogram(binwidth = 1, alpha = 0.2, position = "identity", colour = "red") +
  geom_vline(aes(xintercept = mean(seqLen)), color = "blue", linetype = "dashed", size = 0.5) +
  scale_x_continuous(breaks = seq(round_any(min(seqLen), 10), 
                                round_any(max(seqLen), 10, f = ceiling), 
                                round_any((max(seqLen)-min(seqLen))/10, 10, f = ceiling))) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.05)), 
                     breaks = seq(0, 
                                round_any(max(table(seqLen)), 10, f = ceiling), 
                                round_any(max(table(seqLen))/10, 10, f = ceiling))) +
  labs(x = "sequence length (bp)", title = "Amplicon length distribution") +
  annotate("text", label = "mean length", x = mean(seqLen$seqLen)-2, y = max(table(seqLen)), hjust = 1, colour = "blue") +
  theme_bw() 

Considerations: Sequences that are much longer or shorter than expected may be the result of non-specific priming. The sequence lengths fall within the range of the expected amplicon sizes, most of which fall between 240 and 311 with a long tail on the right. We’ll just leave them as they are.

8 Remove chimeras

The core dada method corrects substitution and indel errors, but chimeras remain. Fortunately, the accuracy of sequence variants after denoising makes identifying chimeric ASVs simpler than when dealing with fuzzy OTUs. Chimeric sequences are identified if they can be exactly reconstructed by combining a left-segment and a right-segment from two more abundant “parent” sequences. As we used pool = TRUE during the sample inference, we should use method = "pooled" for the chimera removal.

seqtab.nochim <- removeBimeraDenovo(seqtab, method = "pooled", multithread = TRUE, verbose = TRUE)
sum(seqtab.nochim)/sum(seqtab)
## [1] 0.9385713
dim(seqtab.nochim)
## [1]   65 4371

The frequency of chimeric sequences varies substantially from dataset to dataset, and depends on on factors including experimental procedures and sample complexity.

Considerations: Most of the reads should remain after chimera removal (it is not uncommon for a majority of sequence variants to be removed though). If most of the reads were removed as chimeric, upstream processing may need to be revisited. In almost all cases this is caused by primer sequences with ambiguous nucleotides that were not removed prior to beginning the DADA2 pipeline.

9 Track reads through the pipeline

As a final check of our progress, we’ll look at the number of reads that made it through each step in the pipeline:

getN <- function(x) sum(getUniques(x))
stats <- cbind(out, sapply(dadaFs, getN), sapply(dadaRs, getN), sapply(mergers, getN), rowSums(seqtab.nochim))
colnames(stats) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
rownames(stats) <- sample.names
datatable(stats)

Plot the sequences stats:

p_stats <- stats %>% 
  as.data.frame() %>%
  rownames_to_column("SampleID") %>%
  mutate_at(vars("filtered":"nonchim"), ~100*.x/input) %>% 
  mutate(input = 100) %>%
  gather(key = "step", value = "percent", -SampleID) %>%
  mutate(step = factor(step, levels = c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim"))) %>%
  ggplot(aes(x = step, y = percent, color = SampleID)) +
    geom_point() +
    geom_line(aes(group = SampleID)) +
    scale_y_continuous(breaks = 0:10*10) +
    labs(x = "", y = "Reads retained (%)") +
    theme_bw()

ggplotly(p_stats, tooltip = c("x", "y", "colour"))

Looks good! Except for the filtering step, we kept the majority of our raw reads.

Considerations: This is a great place to do a last sanity check. Outside of filtering, there should be no step in which a majority of reads are lost. If a majority of reads failed to merge, one may need to revisit the truncLen parameter used in the filtering step and make sure that the truncated reads span the amplicon. If a majority of reads were removed as chimeric, one may need to revisit the removal of primers, as the ambiguous nucleotides in unremoved primers interfere with chimera identification.

10 Export data

Export feature table:

t(seqtab.nochim) %>%
  make_biom() %>%
  write_biom(here::here("data/intermediate/dada2/table-run1.biom"))

Export representative sequences:

uniquesToFasta(seqtab.nochim, fout = here::here("data/intermediate/dada2/rep-seqs-run1.fna"), 
               ids = colnames(seqtab.nochim))

11 Acknowledgements

The processing of raw sequence data into an ASV table is based on the online DADA2 tutorial (1.12). For more documentations and tutorials, visit the DADA2 website.

12 Session information

sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=nb_NO.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=nb_NO.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=nb_NO.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=nb_NO.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] biomformat_1.18.0 plotly_4.9.3      DT_0.17           plyr_1.8.6       
##  [5] dada2_1.18.0      Rcpp_1.0.6        forcats_0.5.1     stringr_1.4.0    
##  [9] dplyr_1.0.5       purrr_0.3.4       readr_1.4.0       tidyr_1.1.3      
## [13] tibble_3.1.0      ggplot2_3.3.3     tidyverse_1.3.0   here_1.0.1       
## [17] knitr_1.31       
## 
## loaded via a namespace (and not attached):
##   [1] colorspace_2.0-0            hwriter_1.3.2              
##   [3] ellipsis_0.3.1              rprojroot_2.0.2            
##   [5] htmlTable_2.1.0             XVector_0.30.0             
##   [7] GenomicRanges_1.42.0        base64enc_0.1-3            
##   [9] fs_1.5.0                    rstudioapi_0.13            
##  [11] farver_2.1.0                fansi_0.4.2                
##  [13] lubridate_1.7.10            xml2_1.3.2                 
##  [15] codetools_0.2-18            splines_4.0.3              
##  [17] Formula_1.2-4               jsonlite_1.7.2             
##  [19] Rsamtools_2.6.0             broom_0.7.5                
##  [21] cluster_2.1.0               dbplyr_2.1.0               
##  [23] png_0.1-7                   compiler_4.0.3             
##  [25] httr_1.4.2                  backports_1.2.1            
##  [27] lazyeval_0.2.2              assertthat_0.2.1           
##  [29] Matrix_1.3-2                cli_2.3.1                  
##  [31] htmltools_0.5.1.1           tools_4.0.3                
##  [33] gtable_0.3.0                glue_1.4.2                 
##  [35] GenomeInfoDbData_1.2.4      reshape2_1.4.4             
##  [37] ShortRead_1.48.0            Biobase_2.50.0             
##  [39] cellranger_1.1.0            jquerylib_0.1.3            
##  [41] rhdf5filters_1.2.0          vctrs_0.3.6                
##  [43] Biostrings_2.58.0           debugme_1.1.0              
##  [45] crosstalk_1.1.1             xfun_0.22                  
##  [47] ps_1.6.0                    rvest_1.0.0                
##  [49] lifecycle_1.0.0             zlibbioc_1.36.0            
##  [51] scales_1.1.1                hms_1.0.0                  
##  [53] MatrixGenerics_1.2.0        parallel_4.0.3             
##  [55] SummarizedExperiment_1.20.0 rhdf5_2.34.0               
##  [57] RColorBrewer_1.1-2          yaml_2.2.1                 
##  [59] gridExtra_2.3               sass_0.3.1                 
##  [61] rpart_4.1-15                latticeExtra_0.6-29        
##  [63] stringi_1.5.3               highr_0.8                  
##  [65] S4Vectors_0.28.0            checkmate_2.0.0            
##  [67] BiocGenerics_0.36.0         BiocParallel_1.24.1        
##  [69] GenomeInfoDb_1.26.0         rlang_0.4.10               
##  [71] pkgconfig_2.0.3             bitops_1.0-6               
##  [73] matrixStats_0.58.0          evaluate_0.14              
##  [75] lattice_0.20-41             Rhdf5lib_1.12.0            
##  [77] labeling_0.4.2              GenomicAlignments_1.26.0   
##  [79] htmlwidgets_1.5.3           tidyselect_1.1.0           
##  [81] magrittr_2.0.1              R6_2.5.0                   
##  [83] IRanges_2.24.0              generics_0.1.0             
##  [85] Hmisc_4.5-0                 DelayedArray_0.16.0        
##  [87] DBI_1.1.1                   pillar_1.5.1               
##  [89] haven_2.3.1                 foreign_0.8-81             
##  [91] withr_2.4.1                 survival_3.2-7             
##  [93] RCurl_1.98-1.3              nnet_7.3-15                
##  [95] modelr_0.1.8                crayon_1.4.1               
##  [97] utf8_1.2.1                  rmarkdown_2.7              
##  [99] jpeg_0.1-8.1                grid_4.0.3                 
## [101] readxl_1.3.1                data.table_1.14.0          
## [103] reprex_1.0.0                digest_0.6.27              
## [105] RcppParallel_5.0.3          stats4_4.0.3               
## [107] munsell_0.5.0               viridisLite_0.3.0          
## [109] bslib_0.2.4